GDP and Life Expectancy Analysis

1. Load the Packages

library(ggplot2)
library(dplyr)
library(ggfortify)
library(factoextra)
library(ICSNP)
library(MASS)
library(glmnet)

2. Load the Data

gap.raw <- read.csv('./gap.csv')
gap <- gap.raw

3. Data Preprocessing

3.1. GDP

gap[,3:14]<- log(gap.raw[,3:14])
gdp <- (gap[,3:14])
years <- seq(1952, 2007,5)
colnames(gdp) <- years
rownames(gdp) <- gap[,2]

Note: log of the gap is called gdp and it is used for the analysis.

3.2. Life Expectancy

lifeExp <- gap[,15:26]
colnames(lifeExp) <- years
rownames(lifeExp) <- gap[,2]

4. Exploratory Data Analysis

4.1. GDP

plot(years, colMeans(exp(gdp)), type='l', xlab='Year', ylab='Average GDP', main='Average GDP over time')

gdp.cont <- aggregate(gdp, by=list(gap$continent), FUN=mean)
plot(years, gdp.cont[1,2:13], type='l', xlab='Year', ylab='Average GDP (log)', main='GDP (log) by continent over time', col=1, ylim=c(6.5, 10.5))
for(i in 2:6){
  lines(years, gdp.cont[i,2:13], col=i)
}
legend('bottomright', legend=gdp.cont[,1], col=1:6, lty=1)

4.2. Life Expectancy

plot(years, colMeans(lifeExp), type='l', xlab='Year', ylab='Average Life Expectancy', main='Average Life Expectancy over time')

lifeExp.cont <- aggregate(lifeExp, by=list(gap$continent), FUN=mean)

plot(years, lifeExp.cont[1,2:13], type='l', xlab='Year', ylab='Average Life Expectancy', main='Life Expectancy by continent over time', col=1, ylim=c(30, 80))
for(i in 2:6){
  lines(years, lifeExp.cont[i,2:13], col=i)
}
legend('bottomright', legend=lifeExp.cont[,1], col=1:6, lty=1)

5. Analysis

5.1. Principal Component Analysis

5.1.1. GDP

gdp.pca.S <- prcomp(gdp, scale=FALSE)
gdp.pca.S.scores <- gdp.pca.S$x
summary(gdp.pca.S)
## Importance of components:
##                           PC1    PC2     PC3    PC4     PC5     PC6     PC7
## Standard deviation     4.0608 0.8718 0.38139 0.2215 0.16930 0.11366 0.09388
## Proportion of Variance 0.9416 0.0434 0.00831 0.0028 0.00164 0.00074 0.00050
## Cumulative Proportion  0.9416 0.9850 0.99328 0.9961 0.99771 0.99845 0.99895
##                            PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.07325 0.06644 0.05813 0.05649 0.04419
## Proportion of Variance 0.00031 0.00025 0.00019 0.00018 0.00011
## Cumulative Proportion  0.99926 0.99951 0.99971 0.99989 1.00000
cum.variance.S <- summary(gdp.pca.S)$importance[3,]
variance.S <- summary(gdp.pca.S)$importance[2,]
plot(cum.variance.S, type='l', xlab='Number of Principal Components', ylab='Cumulative Variance', main='Cumulative Variance of GDP (log)')

Plotting PCA scores

plot(gdp.pca.S, main='PCA Scores of GDP (log)')

According to the plot above, the first three principal components explain 99.3% of the variance in the data. Therefore, we will use the first three principal components for the analysis.

According to the negative and positive values of the principal components, it would be hard to interpret the results directly. Therefore we are going to plot them against each other to see if there is any pattern.

gdp.pca.S.scores <- data.frame(gdp.pca.S.scores)
gdp.pca.S.scores$country <- rownames(gdp.pca.S.scores)
gdp.pca.S.scores$continent <- gap$continent
plt <- ggplot(gdp.pca.S.scores, aes(PC1, PC2, colour=continent)) 
plt <- plt + geom_point()
plt <- plt + geom_text(aes(label=country), size=2) 
plt <- plt + ggtitle('GDP PCA', subtitle='PC1 vs PC2')
plt <- plt + theme(
  plot.title = element_text(hjust = 0.5),
  plot.subtitle = element_text(hjust = 0.5)
)
plt <- plt + xlab(paste('PC1 (', round(variance.S[1]*100, 2), '%)', sep=''))
plt <- plt + ylab(paste('PC2 (', round(variance.S[2]*100, 2), '%)', sep=''))
print(plt)

plt <- ggplot(gdp.pca.S.scores, aes(PC1, PC3, colour=continent)) 
plt <- plt + geom_point()
plt <- plt + geom_text(aes(label=country), size=2) 
plt <- plt + ggtitle('GDP PCA', subtitle='PC1 vs PC3')
plt <- plt + theme(
  plot.title = element_text(hjust = 0.5),
  plot.subtitle = element_text(hjust = 0.5)
)
plt <- plt + xlab(paste('PC1 (', round(variance.S[1]*100, 2), '%)', sep=''))
plt <- plt + ylab(paste('PC3 (', round(variance.S[3]*100, 2), '%)', sep=''))
print(plt)

plt <- ggplot(gdp.pca.S.scores, aes(PC2, PC3, colour=continent)) 
plt <- plt + geom_point()
plt <- plt + geom_text(aes(label=country), size=2) 
plt <- plt + ggtitle('GDP PCA', subtitle='PC2 vs PC3')
plt <- plt + theme(
  plot.title = element_text(hjust = 0.5),
  plot.subtitle = element_text(hjust = 0.5)
)
plt <- plt + xlab(paste('PC2 (', round(variance.S[2]*100, 2), '%)', sep=''))
plt <- plt + ylab(paste('PC3 (', round(variance.S[3]*100, 2), '%)', sep=''))
print(plt)

As we can see from the plots, the first two principal component could be used to separate the countries based on their continent. e.g. African countries are mostly on the right side of the plot, while European and Oceania countries are mostly on the left side of the plot.

The same result can be shown in the plot of PC1-PC3. However, the plot of PC2-PC3 shows that there is no clear separation between the countries based on their continent.

Between these countries, Kuwait behaves differently from the other countries. It is located far from other countries with low PC1 and high PC2 and PC3. This is because Kuwait has a very high GDP compared to other countries.

5.1.2. Life Expectancy

lifeExp.pca.S <- prcomp(lifeExp, scale=FALSE)
lifeExp.pca.S.scores <- lifeExp.pca.S$x
summary(lifeExp.pca.S)
## Importance of components:
##                            PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     38.6350 9.83955 4.50873 2.52549 1.39085 1.27733 1.01896
## Proportion of Variance  0.9203 0.05969 0.01253 0.00393 0.00119 0.00101 0.00064
## Cumulative Proportion   0.9203 0.97994 0.99247 0.99641 0.99760 0.99860 0.99924
##                            PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.79691 0.48377 0.43319 0.33836 0.23486
## Proportion of Variance 0.00039 0.00014 0.00012 0.00007 0.00003
## Cumulative Proportion  0.99964 0.99978 0.99990 0.99997 1.00000
cum.variance.S <- summary(lifeExp.pca.S)$importance[3,]
variance.S <- summary(lifeExp.pca.S)$importance[2,]
plot(cum.variance.S, type='l', xlab='Number of Principal Components', ylab='Cumulative Variance', main='Cumulative Variance of Life Expectancy')

Plotting PCA scores

plot(lifeExp.pca.S, main='PCA Scores of Life Expectancy')

According to the plot above, the first three principal components explain 99.2% of the variance in the data. Therefore, we will use the first three principal components for the analysis.

lifeExp.pca.S.scores <- data.frame(lifeExp.pca.S.scores)
lifeExp.pca.S.scores$country <- rownames(lifeExp.pca.S.scores)
lifeExp.pca.S.scores$continent <- gap$continent
plt <- ggplot(lifeExp.pca.S.scores, aes(PC1, PC2, colour=continent)) 
plt <- plt + geom_point()
plt <- plt + geom_text(aes(label=country), size=2) 
plt <- plt + ggtitle('Life Expectancy PCA', subtitle='PC1 vs PC2')
plt <- plt + theme(
  plot.title = element_text(hjust = 0.5),
  plot.subtitle = element_text(hjust = 0.5)
)
plt <- plt + xlab(paste('PC1 (', round(variance.S[1]*100, 2), '%)', sep=''))
plt <- plt + ylab(paste('PC2 (', round(variance.S[2]*100, 2), '%)', sep=''))
print(plt)

plt <- ggplot(lifeExp.pca.S.scores, aes(PC1, PC3, colour=continent)) 
plt <- plt + geom_point()
plt <- plt + geom_text(aes(label=country), size=2) 
plt <- plt + ggtitle('Life Expectancy PCA', subtitle='PC1 vs PC3')
plt <- plt + theme(
  plot.title = element_text(hjust = 0.5),
  plot.subtitle = element_text(hjust = 0.5)
)
plt <- plt + xlab(paste('PC1 (', round(variance.S[1]*100, 2), '%)', sep=''))
plt <- plt + ylab(paste('PC3 (', round(variance.S[3]*100, 2), '%)', sep=''))
print(plt)

plt <- ggplot(lifeExp.pca.S.scores, aes(PC2, PC3, colour=continent)) 
plt <- plt + geom_point()
plt <- plt + geom_text(aes(label=country), size=2) 
plt <- plt + ggtitle('Life Expectancy PCA', subtitle='PC2 vs PC3')
plt <- plt + theme(
  plot.title = element_text(hjust = 0.5),
  plot.subtitle = element_text(hjust = 0.5)
)
plt <- plt + xlab(paste('PC2 (', round(variance.S[2]*100, 2), '%)', sep=''))
plt <- plt + ylab(paste('PC3 (', round(variance.S[3]*100, 2), '%)', sep=''))
print(plt)

As we can see from the plots, the first two principal component could be used to separate the countries based on their continent. e.g. African countries are mostly on the left side of the plot, while European and Oceania countries are mostly on the right side of the plot.

The same result can be shown in the plot of PC1-PC3. However, the plot of PC2-PC3 shows that there is no clear separation between the countries based on their continent.

Between these countries, Zimbabwe, and Zambia behave differently from the other countries. They are located far from other countries with low high PC2 and PC3. This is because these countries have a very low life expectancy compared to other countries.

5.2. Multidimensional Scaling (MDS)

Performing MDS on the combined dataset of GDP and Life Expectancy.

combined.ds <- gap[,3:26]
n <- nrow(combined.ds)
H <- diag(rep(1,n))-rep(1,n)%*%t(rep(1,n))/n
combined.ds <- H %*% as.matrix(combined.ds)
distances <- dist(combined.ds)
mds <- cmdscale(distances, k=2)

Now we will plot a 2-dimensional representation of the data.

mds <- data.frame(mds)
mds$country <- gap$country
mds$continent <- gap$continent
plt <- ggplot(mds, aes(X1, X2, colour=continent))
plt <- plt + geom_point()
plt <- plt + geom_text(aes(label=country), size=2) 
plt <- plt + ggtitle('Multidimensional scaling')
plt <- plt + theme(
  plot.title = element_text(hjust = 0.5),
  plot.subtitle = element_text(hjust = 0.5)
)
plt <- plt + xlab('V1')
plt <- plt + ylab('V2')
print(plt)

The plot above shows that the countries are separated based on their continent. e.g. African countries are mostly on the left side of the plot, while European and Oceania countries are mostly on the right side of the plot same as what we saw in the PCA plot of life expectancy.

5.3. Hypothesis Testing

Now we will conduct a multivariate hypothesis test for year 2007 and 1952 for Asian and European countries. The null hypothesis is that the mean of the GDP/life expectancy of the countries in each mentioned continent is the same. The alternative hypothesis is that the mean of the GDP/life expectancy of the countries in the continent is different.

gdp.2007 <- gdp %>% dplyr::select('2007')
lifeExp.2007 <- lifeExp %>% dplyr::select('2007')
gdp.2007$continent <- gap$continent
lifeExp.2007$continent <- gap$continent

gdp.2007.asia <- data.frame(gdp.2007[gdp.2007$continent == 'Asia',1])
lifeExp.2007.asia <- data.frame(lifeExp.2007[lifeExp.2007$continent == 'Asia',1])
gdp.2007.europe <- data.frame(gdp.2007[gdp.2007$continent == 'Europe',1])
lifeExp.2007.europe <- data.frame(lifeExp.2007[lifeExp.2007$continent == 'Europe',1])
gdp.1952 <- gdp %>% dplyr::select('1952')
lifeExp.1952 <- lifeExp %>% dplyr::select('1952')
gdp.1952$continent <- gap$continent
lifeExp.1952$continent <- gap$continent

gdp.1952.asia <- data.frame(gdp.1952[gdp.1952$continent == 'Asia',1])
lifeExp.1952.asia <- data.frame(lifeExp.1952[lifeExp.1952$continent == 'Asia',1])
gdp.1952.europe <- data.frame(gdp.1952[gdp.1952$continent == 'Europe',1])
lifeExp.1952.europe <- data.frame(lifeExp.1952[lifeExp.1952$continent == 'Europe',1])

5.3.1. GDP

HotellingsT2(gdp.2007.asia, gdp.2007.europe)
## 
##  Hotelling's two sample T2-test
## 
## data:  gdp.2007.asia and gdp.2007.europe
## T.2 = 25.215, df1 = 1, df2 = 61, p-value = 4.747e-06
## alternative hypothesis: true location difference is not equal to c(0)
qf(0.95, 1, 61)
## [1] 3.998494

The Hotelling’s T2 test statistic is 25.215 and the F-statistic is 3.99. so we reject the null hypothesis. This means that the mean of the GDP of the countries in Asia and Europe in 2007 is different.

Let’s compare these two continents in 1952.

HotellingsT2(gdp.1952.asia, gdp.1952.europe)
## 
##  Hotelling's two sample T2-test
## 
## data:  gdp.1952.asia and gdp.1952.europe
## T.2 = 24.088, df1 = 1, df2 = 61, p-value = 7.196e-06
## alternative hypothesis: true location difference is not equal to c(0)
qf(0.95, 1, 61)
## [1] 3.998494

The Hotelling’s T2 test statistic is 24 and the F-statistic is 3.99. so we reject the null hypothesis. This means that the mean of the GDP of the countries in Asia and Europe in 1952 is different. Although they are more similar in 1952 than 2007.

5.3.2. Life Expectancy

HotellingsT2(lifeExp.2007.asia, lifeExp.2007.europe)
## 
##  Hotelling's two sample T2-test
## 
## data:  lifeExp.2007.asia and lifeExp.2007.europe
## T.2 = 20.072, df1 = 1, df2 = 61, p-value = 3.34e-05
## alternative hypothesis: true location difference is not equal to c(0)
qf(0.95, 1, 61)
## [1] 3.998494

The Hotelling’s T2 test statistic is 20 and the F-statistic is 3.99. so we reject the null hypothesis. This means that the mean of the Life Expectancy of the countries in Asia and Europe in 2007 is different.

Let’s compare these two continents in 1952.

HotellingsT2(lifeExp.1952.asia, lifeExp.1952.europe)
## 
##  Hotelling's two sample T2-test
## 
## data:  lifeExp.1952.asia and lifeExp.1952.europe
## T.2 = 79.73, df1 = 1, df2 = 61, p-value = 1.129e-12
## alternative hypothesis: true location difference is not equal to c(0)
qf(0.95, 1, 61)
## [1] 3.998494

The Hotelling’s T2 test statistic is 79.73 and the F-statistic is 3.99. so we reject the null hypothesis. This means that the mean of the Life Expectancy of the countries in Asia and Europe in 1952 is different. Although they are more similar in 2007 than 1952 which is the opposite of what we saw in the GDP test.

5.4. Linear Discriminant Analysis (LDA)

We will now look at whether linear discriminant analysis can be used to successfully separate the continents. We will use lda to predict the continent of a country based on its GDP and life expectancy.

set.seed(1221)
# combine gdp and life expectancy
combined.ds <- gap[,3:26]
n <- nrow(combined.ds)
test_index <- sample(1:n, 0.2*n)
train <- combined.ds[-test_index,]
test <- combined.ds[test_index,]
lda.model.train <- lda(train, gap$continent[-test_index])
lda.model.test <- predict(lda.model.train, test)

print(paste('LDA accuracy:', sum(lda.model.test$class == gap$continent[test_index])/length(lda.model.test$class)*100, '%', sep=''))
## [1] "LDA accuracy:64.2857142857143%"
table(lda.model.test$class, gap$continent[test_index])
##           
##            Africa Americas Asia Europe
##   Africa        8        0    1      0
##   Americas      0        1    1      0
##   Asia          0        2    5      0
##   Europe        0        4    2      4
##   Oceania       0        0    0      0
proj <- predict(lda.model.train, type="proba")$x[, 1:2]
proj.df <- data.frame(proj, continent = gap$continent[-test_index])
ggplot(proj.df, aes(x = LD1, y = LD2, color = continent)) + 
  geom_point() +
  labs(x = "LD1", y = "LD2", color = "Continent")

proj <- predict(lda.model.train, test, type="proba")$x[, 1:2]
proj.df <- data.frame(proj, continent = gap$continent[test_index])
ggplot(proj.df, aes(x = LD1, y = LD2, color = continent)) + 
  geom_point() +
  labs(x = "LD1", y = "LD2", color = "Continent")

As it can be seen from the plots above, the LDA model is able to separate the continents based on the GDP and life expectancy of the countries. The plots are somehow similar to the PCA plots we saw earlier. But it seems Life Expectancy is more important than GDP in separating the continents in this case.

5.5. Clustering

5.5.1. K-Means

We will now use K-Means clustering to cluster the countries based on their GDP and life expectancy.

5.5.1.1. GDP
set.seed(1221)
gdp.kmeans <- kmeans(gdp, centers = 3)
table(gdp.kmeans$cluster, gap$continent)
##    
##     Africa Americas Asia Europe Oceania
##   1      2        5    8     24       2
##   2     37        1   13      0       0
##   3     13       19   12      6       0
fviz_cluster(gdp.kmeans, data = gdp, ellipse.type = "norm", 
             palette = "jco", ggtheme = theme_minimal())

fviz_nbclust(gdp, kmeans, method = "wss")

5.5.1.2. Life Expectancy
set.seed(1221)
lifeExp.kmeans <- kmeans(lifeExp, centers = 3)
table(lifeExp.kmeans$cluster, gap$continent)
##    
##     Africa Americas Asia Europe Oceania
##   1      2       14    9     29       2
##   2     34        1    5      0       0
##   3     16       10   19      1       0
fviz_cluster(lifeExp.kmeans, data = lifeExp, ellipse.type = "norm", 
             palette = "jco", ggtheme = theme_minimal())

fviz_nbclust(lifeExp, kmeans, method = "wss")

As we can see, we use the elbow method to find the optimal number of clusters. For GDP, the optimal number of clusters is 3 and for life expectancy, it is also 3. The clusters are not very similar to the continents.

5.5.2. Hierarchical Clustering

gap.scaled <- gap
gap.scaled[,3:26] <- scale(gap[,3:26]) 
gap.dist <- dist(gap.scaled[,3:26], method = "euclidean")
plot(hclust(gap.dist, method = "complete"), hang = -1, cex = 0.6, main = "Complete Linkage Clustering", labels = gap$country)

plot(hclust(gap.dist, method = "single"), hang = -1, cex = 0.6, main = "Single Linkage Clustering", labels = gap$country)

plot(hclust(gap.dist, method = "average"), hang = -1, cex = 0.6, main = "Average Linkage Clustering", labels = gap$country)

Plots above show the hierarchical clustering of the countries based on their GDP and life expectancy. The complete linkage clustering seems to be the best one. The clusters are not very similar to the continents.

5.6. Linear Regression

We will now use linear regression to predict 2007 life expectancy from gdp values.

lifeExp.2007 <- lifeExp %>% dplyr::select('2007')
df <- cbind(gdp, lifeExp.2007)
colnames(df)[ncol(df)] <- 'target'
set.seed(1221)
n <- nrow(df)
test_index <- sample(1:n, 0.2*n)
train <- df[-test_index,]
test <- df[test_index,]
model.lm <- lm(target ~ ., data = train)
summary(model.lm)
## 
## Call:
## lm(formula = target ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.9286  -2.1059   0.5705   3.8847  12.0786 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.5357     5.4798   0.645  0.52024   
## `1952`       -2.8641     7.2310  -0.396  0.69287   
## `1957`       12.1431    11.0003   1.104  0.27227   
## `1962`       -8.1736    12.0940  -0.676  0.50069   
## `1967`        3.0511     8.7295   0.350  0.72743   
## `1972`       -8.0452     7.5957  -1.059  0.29204   
## `1977`        0.7646     6.7455   0.113  0.90997   
## `1982`       -1.5963     8.6317  -0.185  0.85365   
## `1987`        8.2826     7.3898   1.121  0.26502   
## `1992`      -11.2368     6.7644  -1.661  0.09978 . 
## `1997`       22.5288     8.3505   2.698  0.00818 **
## `2002`      -15.3694     9.1163  -1.686  0.09490 . 
## `2007`        8.0100     5.3722   1.491  0.13907   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.139 on 101 degrees of freedom
## Multiple R-squared:  0.7003, Adjusted R-squared:  0.6647 
## F-statistic: 19.67 on 12 and 101 DF,  p-value: < 2.2e-16
pred <- predict(model.lm, test)
print(paste('RMSE:', sqrt(mean((pred - test$target)^2))))
## [1] "RMSE: 6.40057331119658"

The RMSE is 6.4 for the test set for linear regression model.

Let’s try Ridge regression to see if we can improve the RMSE.

train.matrix <- as.matrix(train[,-ncol(train)])
test.matrix <- as.matrix(test[,-ncol(test)])
train.target <- train$target
test.target <- test$target
model.ridge <- glmnet(train.matrix, train.target, alpha = 0)
plot(model.ridge, xvar = "lambda", label = FALSE)

lambdas <- 10^seq(3,-2,by=-0.1)
model.ridge.cv <- cv.glmnet(train.matrix, train.target, alpha = 0, lambda = lambdas)
plot(model.ridge.cv)

min.lambda <- model.ridge.cv$lambda.min
min.1se.lambda <- model.ridge.cv$lambda.1se
print(paste('Min lambda:', min.lambda))
## [1] "Min lambda: 1.99526231496888"
print(paste('Min 1se lambda:', min.1se.lambda))
## [1] "Min 1se lambda: 39.8107170553497"
coef <- coef(model.ridge.cv, s = model.ridge.cv$lambda.min)
print(paste('Number of coefficients:', length(coef[coef != 0])))
## <sparse>[ <logic> ]: .M.sub.i.logical() maybe inefficient
## [1] "Number of coefficients: 13"
pred <- predict(model.ridge.cv, test.matrix, s = model.ridge.cv$lambda.min)
print(paste('RMSE:', sqrt(mean((pred - test.target)^2))))
## [1] "RMSE: 6.34290896051329"

The RMSE is 6.34 for the test set for Ridge regression model. It is a bit better than the linear regression model. The number of coefficients is 13 which is the same as the number of features. This means that the Ridge regression model is not able to reduce the number of features. Let’s try Lasso regression to see if we can improve the RMSE.

model.lasso <- glmnet(train.matrix, train.target, alpha = 1)
plot(model.lasso, xvar = "lambda", label = FALSE)

lambdas <- 10^seq(3,-2,by=-0.1)
model.lasso.cv <- cv.glmnet(train.matrix, train.target, alpha = 1, lambda = lambdas)
plot(model.lasso.cv)

min.lambda <- model.lasso.cv$lambda.min
min.1se.lambda <- model.lasso.cv$lambda.1se
print(paste('Min lambda:', min.lambda))
## [1] "Min lambda: 0.398107170553497"
print(paste('Min 1se lambda:', min.1se.lambda))
## [1] "Min 1se lambda: 3.16227766016838"
coef <- coef(model.lasso.cv, s = model.lasso.cv$lambda.min)
print(paste('Number of coefficients:', length(coef[coef != 0])))
## <sparse>[ <logic> ]: .M.sub.i.logical() maybe inefficient
## [1] "Number of coefficients: 3"
pred <- predict(model.lasso.cv, test.matrix, s = model.lasso.cv$lambda.min)
print(paste('RMSE:', sqrt(mean((pred - test.target)^2))))
## [1] "RMSE: 6.27065071067278"

The RMSE is 6.27 for the test set for Lasso regression model. It is better than the other models. The number of coefficients is 3 which is much less than the number of features. This means that the Lasso regression model is able to reduce the number of features.

5.7. Conclusion

In this project, we have analyzed the Gapminder dataset. We have found that the life expectancy has increased over the years. We have also found that the GDP has increased over the years. We have found that the life expectancy is correlated with the GDP. We have also found that the life expectancy is correlated with the GDP per capita.

Good Luck!